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Abstract 

We propose a new procedure for estimating high dimensional Gaussian graphical 
models. Our approach is asymptotically tuning-free and non-asymptotically tuning- 
insensitive: it requires very few efforts to choose the tuning parameter in finite sample 
settings. Computationally, our procedure is significantly faster than existing methods 
due to its tuning-insensitive property. Theoretically, the obtained estimator is simulta- 
neously minimax optimal for precision matrix estimation under different norms. Em- 
pirically, we illustrate the advantages of our method using thorough simulated and real 
examples. The R package bigmatrix implementing the proposed methods is available 
on the Comprehensive R Archive Network: http://cran.r-project.org/. 

1 Introduction 

We consider the problem of learning high dimensional Gaussian graphical models: let 
aii, . .. ,x n be n data points from a d-dimensional random vector X = (X±, ...,X^) with 
X ~ 7V^(0, S). We want to estimate an undirected graph denoted by G = (V,E), where 
V contains nodes corresponding to the d variables in X, and the edge set E describes the 
conditional independence relationships between Xi, ...,Xd- Let X^^y := {Xg : £ / i, j}. 
We say the joint distribution of X is Markov to G if Xj is independent of X^ given X\^j k y 
for all (j, k) E. For Gaussian distributions, the graph G is known to be encoded by the 
precision matrix := S _1 . More specifically, no edge connects Xj and Xu if and only if 
&jk = 0. The graph estimation problem is then reduced to the estimation of the precision 
matrix 0. Such a problem is also called covariance selection (Dempster, 1972). 
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In low dimensions where d < n, Drton and Perlman (2007, 2008) develop a multiple 
testing procedure for identifying the sparsity pattern of the precision matrix. In high 
dimensions where d S> n, Meinshausen and Biihlmann (2006) propose a neighborhood 
pursuit approach for estimating Gaussian graphical models by solving a collection of sparse 
regression problems using the Lasso (Tibshirani, 1996; Chen et al., 1998). This approach can 
be viewed as a pseudo-likelihood approximation of the full likelihood. A related approach 
is to directly estimate by penalizing the likelihood using the Li-penalty (Banerjee et al., 
2008; Yuan and Lin, 2007; Friedman et al., 2008). To further reduce the estimation bias, 
Lam and Fan (2009); Jalali et al. (2012); Shen et al. (2012) propose either greedy algorithms 
or non-convex penalties for sparse precision matrix estimation. Under certain conditions, 
Ravikumar et al. (2011); Rothman et al. (2008) study the theoretical properties of the 
penalized likelihood methods. Yuan (2010) and Cai et al. (2011a) also propose the graphical 
Dantzig selector and CLIME respectively, which can be solved by linear programming and 
are more amenable to theoretical analysis than the penalized likelihood approach. More 
recently, Liu and Luo (2012) and Sun and Zhang (2012) propose the SCIO and scaled-Lasso 
methods, which estimate the sparse precision matrix in a column-by-column fashion and 
have good theoretical properties. 

Besides Gaussian graphical models, Liu et al. (2012) propose a semiparametric pro- 
cedure named non-paranormal skeptic which extends the Gaussian family to the more 
flexible semiparametric Gaussian copula family. Instead of assuming X follows a Gaussian 
distribution, they assume there exists a set of monotone functions /i, . . . , /<j, such that the 
transformed data f(X) := (fi(X±), . . . , fd{Xd)) T is Gaussian. More details can be found in 
Liu et al. (2012) and Lafferty et al. (2012). Zhao et al. (2012) developed a scalable software 
package to implement the nonparanormal algorithms. Other nonparametric graph estima- 
tion methods include forest graphical models (Liu et al., 2011) or conditional graphical 
models (Liu et al., 2010a). 

Most of these methods require choosing some tuning parameters that control the bias- 
variance tradeoff. Theoretical justifications of these methods are usually built on some 
oracle choices of tuning parameters that cannot be implemented in practice. It remains 
an challenging problem on choosing the regularization parameter in a data-dependent way. 
Popular techniques include the C p -statistic (Mallows, 1973), AIC (Akaike information crite- 
rion, Akaike (1973)), BIC (Bayesian information criterion, Schwarz (1978)), extended BIC 
(Chen and Chen, 2008, 2012; Foygel and Drton, 2010), RIC (Risk inflation criterion, Fos- 
ter and George (1994)), cross validation (Efron, 1982), and covariance penalization (Efron, 
2004). Most of these methods require data splitting and have been only justified for low 
dimensional settings. Significant progress has been made recently on developing likelihood- 
free regularization selection techniques, including permutation methods (Wu et al., 2007; 
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Boos et al., 2009; Lysen, 2009) and subsampling methods (Lange et al., 2004; Ben-david 
et al., 2006; Meinshausen and Biihlmann, 2010; Bach, 2008). Meinshausen and Biihlmann 
(2010) and Bach (2008) and Liu et al. (2010b) also propose to select the tuning parame- 
ters using subsampling. However, these subsampling based methods are computationally 
expensive and are still lack of theoretical guarantees. 

In this paper we propose a new procedure for estimating high dimensional Gaussian 
graphical models. Our method, named TIGER (Tuning-Insensitive Graph Estimation and 
Regression) , owes a "tuning-insensitive property" : it automatically adapts to the unknown 
sparsity pattern and is asymptotically tuning- free. In finite sample settings, we only need 
to pay very few efforts on tuning the regularization parameter. The main idea is to estimate 
the precision matrix in a column-by-column fashion. For each column, the computation 
is reduced to a sparse regression problem. This idea has been adopted by many methods, 
including the neighborhood pursuit (Meinshausen and Biihlmann, 2006), graphical Dantzig 
selector (Yuan, 2010), CLIME (Cai et al., 2011a), SCIO (Liu and Luo, 2012), and the 
scaled-Lasso method (Sun and Zhang, 2012). These methods differ from each other mainly 
by how they solve the sparse regression subproblem: the graphical Dantzig selector and 
CLIME use the Dantzig selector, the SCIO and neighborhood pursuit use the Lasso, while 
Sun and Zhang (2012) use the scaled-Lasso (Sun and Zhang, 2012). Unlike these existing 
methods, the TIGER solves this sparse regression problem using the SQRT-Lasso (Belloni 
et al., 2012). By using the SQRT-Lasso regression, the TIGER owes the tuning-insensitive 
property and improves upon existing methods both theoretically and empirically. 

The main advantage of the TIGER over existing methods is its asymptotic tuning- 
free property, which allows us to use the entire dataset to efficiently learn and select the 
model. In contrast, it is well known that the cross-validation and subsampling methods 
are computationally expensive. Moreover, they may potentially waste valuable data which 
could otherwise be exploited to learn a better model (Bishop et al., 2003). With such a 
tuning-insensitive property, the TIGER also allows us to conduct more objective scientific 
data analysis. 

Another advantage of the TIGER is its computational simplicity and scalability for 
large datasets. For problems with large dimensionality d, the TIGER divides the whole 
problem into many subproblems, in each subproblem it estimates one column of the preci- 
sion matrix by solving a simple SQRT-Lasso problem. The final matrix estimator is formed 
by combining the vector solutions into a matrix. This procedure can be solved in a par- 
allel fashion and achieves a linear scale up with the number of CPU cores. An additional 
performance improvement comes from the tuning-insensitive property of the TIGER. Most 
existing methods exploit cross-validation to choose tuning parameters, which requires com- 
puting the solutions over a full regularization path. In contrast, the TIGER solves the 
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SQRT-Lasso subproblem only for several fixed tuning parameters. For sparse problems, 
this is significantly faster than existing methods. 

In the current paper, we prove theoretical guarantees of the TIGER esitmator of 
the true precision matrix 0. In particular, let ||0|| m ax := maxjfc \&jk\ and ||0||i := 
max,,- \&jk\- Under the assumption that the condition number of is bounded by 
a constant, we establish the elementwise sup- norm rate of convergence: 



^- Q Lax = Op(l|0||lV^)' (-'-) 



If we further assume 1 1 G> 1 1 2 < where may scale with d, the obtained rate in (1) 
is minimax optimal over the model class consisting of precision matrices with bounded 
condition numbers. This result allows us to effectively conduct graph estimation without 
the need of irrepresentable conditions. 

Let /(•) be the indicator function and s := J2j^k ^ (®jk / 0) be the number of nonzero 
off-diagonal elements of 0. The result in (1) implies that the Frobenious norm error between 
and satisfies: 



II© - e|| F := /E|e Jt - B lk f = op (||e||,^±£2L!) . (2) 

Similarly, if we assume 1 1 €) 1 1 2 < where may scale with d, the rate in (2) is minimax 
optimal for the Frobenious norm error in the same model class consisting of precision 
matrices with bounded condition numbers. 

Let ||0||2 be the largest eigenvalue of (i.e., ||0||2 is the spectral norm of 0) and 
k := max i=lj ..^ V • / 0). We also prove that 




I© - 0|| 2 < ||© - ©Hi = Op fc||0|| 2 y — • (3) 



Under the same condition that 1 1 G) 1 1 1 < M<± where may scale with d, this spectral norm 
rate in (3) is also minimax optimal over the same model class as before. 

Besides these theoretical results, we also establish a relationship between the SQRT- 
Lasso and the scaled-Lasso proposed by Sun and Zhang (2012). More specifically, the 
objective function of the scaled-Lasso can be viewed as a variational upper bound of the 
SQRT-Lasso. This relationship allows us to develop a very efficient algorithm for the 
TIGER . 

Computationally, the TIGER is significantly faster than existing methods since very few 
tunings are needed. In particularly, we propose an iterative algorithm with initial values 
searched by the Alternating Direction Method of Multipliers (ADMM). For each reduced 
sparse regression subproblem, the computational complexity is of the same order as solving 
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one single Lasso with a sparse solution. Under the parallel computing framework, our 
algorithm is even faster and more scalable than the glasso and huge packages (Friedman 
et al., 2008; Zhao et al., 2012). Empirically, we present thorough numerical simulations 
to compare the graph recovery and parameter estimation performance of our method with 
other approaches. A real data experiment on a gene expression dataset is also provided 
to back up our theory. The R package bigmatrix implementing the proposed methods is 
available on the Comprehensive R Archive Network: http://cran.r-project.org/. 

The rest of the paper is organized as follows. In Section 2, we introduce basic notations 
and backgrounds on Gaussian graphical models. In Section 3, we present the TIGER esti- 
mator and its computational algorithm. In Section 4, we present the theoretical properties 
including the rates of convergence for parameter estimation and graph recovery. We also 
provide further discussions on the connections and differences of our results with other re- 
lated methods. In Section 5, we demonstrate its numerical performance through synthetic 
and real datasets. The proofs of the main results are given in the appendix. 



2 Background 

Let xi,...,x n be n data points from a <i-dimensional Gaussian random vector X : — 
(Ai, . . . , Xd) T ~ ./Vd(0,X). We denote Xi := (xn, . . . , Xid) T . As has been discussed in 
the previous section, we define precision matrix to be := S _1 . In this section, we start 
with some notations followed by an introduction of solving Gaussian graphical models in a 
column-by-column approach. 

2.1 Notations 

Let v := (vi, . . . , Vd) T G M d and /(•) be the indicator function, for < q < oo, we define 

H,:=(X>I<) 1/5 . (4) 

J' = l 

We also define ||i>||o := Ylj=i H v j 0) an d IMloo := maxj \vj\. 

Let A 6 R dxd be a symmetric matrix and /, J C {1, . . . , d} be two sets, we denote Aj ; j 
to be the submatrix of A with rows and columns indexed by / and J. Let A*j be the j th 
column of A and A*\j be the submatrix of A with the j th column A*j removed. 

We define the following matrix norms: 

||A|| g := max ||A«|| g , || A|| max := max | A,- fe |, and || A|| F = CS~] \ A jk \ 2 ) 1/2 . (5) 

It is easy to see that when g = oo, ||A||oo = ||A||i. We also denote A max (A) and A m ; n (A) 
to be the largest and smallest eigenvalues of A. 
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2.2 Gaussian Graphical Model and Column-by-Column Regression 

Let X ~ Nd(0, S), the conditional distribution of Xj given X\j satisfies 

Xj | X\j ~ 7V d _i , - . (6) 

Let a, := (E^)"^ G R"" 1 and rxj := - S^^)" 1 ^- We have 

•V, X , • f ; , (7) 



where €j ~ iV(0, cr?) is independent of -Xaj. By the block matrix inversion formula, we 
have 

Qjj = (Var^-))" 1 = of, (8) 
= -(Var( £j )rS- = -°f<*j- (9) 

Therefore, we can recover in a column by column manner by regressing Xj on X\j for 
j = 1, 2, • • • , d. For example, let 



X := 



/ xii • • • Xl d \ 



\ %nl ' ' ' x nd j 



G R nxd (10) 



be the data matrix. We denote by otj := (ayi, ■ ■ ■ , ajM-i)) T £ ^ 1 - Meinshausen and 
Biihlmann (2006) propose to iteratively estimate each ctj by solving the Lasso regression: 

1 2 

otj = argmin — ||X*j - X*\jCtj\\ 2 + \j\\oLj\\ v (11) 

where Xj is a tuning parameter. Once ctj is given, we get the neighborhood edges by 
reading out the nonzero coefficients of otj. The final graph estimate G is obtained by either 
the "AND" or "OR" rule on combining the neighborhoods for all the d nodes. However, 
the neighborhood pursuit method of Meinshausen and Biihlmann (2006) only estimates the 
graph G and does not estimate the inverse covariance matrix 0. 

To explicitly estimate 0, Yuan (2010) proposes to estimate otj by solving the Dantzig 
selector: 

a.j = argmin || otj \\ 1 subject to - ^\j,\jOtj\\ < Jj, (12) 



a 



where S := ^XX T is the sample covariance matrix and jj is a tuning parameter. Once ctj 
is given, we can estimate ctJ by 

a) = [1 - 2ajE w + aj£ h /v,- \ (13) 
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We then get the estimate of by plugging otj and <r| into (8) and (9). Yuan (2010) 
analyzes the Li-norm error ||0 — 0||i and shows its minimax optimality over certain model 
space. However, no graph estimation result is provided for this approach. 

In another work, Sun and Zhang (2012) propose to estimate a.j and aj by solving a 
scaled-Lasso problem: 

b j ,a j = argmin <^ — h - + A ^ EfcJfrfcl subject to bj = -1 \. (14) 

b=(b u ...,b d ) T ,a I 2(7 2 t^i J 

Once bj is obtained, ctj = b\j. Sun and Zhang (2012) analyze the spectral- norm rate 
of convergence of the obtained precision matrix estimator. They did not investigate the 
elementwise sup-norm and graph recovery performance. In the next section, we will show 
that the scaled-Lasso estimator is highly related to our proposed procedure and will discuss 
the relationship in more details. 

To estimate both precision matrix and graph G, Cai et al. (2011a) proposes the 
CLIME estimator, which directly estimates the j th column of by solving 

0*7 = argminU©*,!^ subject to ||S0*j — ej|| < Sj, for j = 1, . . . , d, (15) 

where ej is the j th canonical vector and Sj is a tuning parameter. Cai et al. (2011a) 
show that this convex optimization can be formulated into a linear program and has the 
potential to scale to large problems. Once is obtained, we use another tuning parameter 
r to threshold to estimate the graph G. In a follow-up work of the CLIME, Liu and Luo 
(2012) propose the SCIO estimator, which solves the j th column of by 

0*j = argmin j -0^£©*j — ej®*j + AjU©*^^ 1. (16) 

The justifications of most of these graph estimation methods are built on some theoretical 
choices of tuning parameters that cannot be implemented in practice. For example, in the 
neighborhood pursuit method and the graphical Dantzig selector, the tuning parameter Xj 
and 7j depend on cr|, which is unknown. Practically, we usually set A = Ai = • • • = A^ 
and 7 = 71 = • • • = 7d to reduce the number of tuning parameters. However, as we 
will illustrate in later sections, such a choice makes the estimating procedure non-adaptive 
to inhomogeneous graphs. The tuning parameters of the CLIME and SCIO depend on 
|| ||i, which is unknown. In general, these methods employ cross validation to conduct 
data-dependent tuning parameter selection and Liu and Luo (2012) provide theoretical 
analysis of the cross-validation estimator. However, as we discussed before, cross-validation 
is computationally expensive and a waste of valuable training data. In the next section, 
we will describe a tuning-insensitive procedure that simultaneously estimates the precision 
matrix and graph G with the optimal rates of convergence. 
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3 Method 



In this section we introduce the use of the SQRT-Lasso from Belloni et al. (2012) for 
simultaneously estimating the graph G and precision matrix & := X -1 . 

The SQRT-Lasso is a penalized optimization algorithm for solving high dimensional 
linear regression problems. For a linear regression problem y = X/3 + e, where y € W l is 
the response, X £ W ixd is the design matrix, /3 6 M d is the vector of unknown coefficients, 
and £ G R" is the noise vector. The SQRT-Lasso estimates (5 by solving 

P = arg min{-Uj/ - X/3|| 2 + A||/3||i), (17) 

where A is the tuning parameter. It is shown in Belloni et al. (2012) that the choice of A for 
the SQRT-Lasso method is asymptotically universal and does not depend on any unknown 
parameter. In contrast, most of other methods, including the Lasso and Dantzig selector, 
rely heavily on a known standard deviation of the noise. Moreover, the SQRT-Lasso method 
achieves near oracle performance for the estimation of /3. 

3.1 TIGER for Graph and Precision Matrix Estimation 

In the discussion of this section, we always condition on the observed data x±,..., x n . Let 
r := diag(S) be a ti-dimensional diagonal matrix with the diagonal elements be the same 
as those in S. We define 

Z :=(Z 1 ,...,Z d ) T = XT- 1 / 2 . (18) 
Z j T^- a ^ j Z\ J - (j . (19) 



By (7), we have 



We define 



Therefore, we have 



/V ^/A/ 2 ^ and -f-^Jr (2°) 



Z s P]Z j -Y j :\ r (21) 

We define R to be the sample correlation matrix: R := (diag(S)) ^ 2 5](diag(£)) 1 ^ 2 . 
Motivated by the model in (21), we propose the following precision matrix estimator. 

For j = 1, . . . , d, we estimate the j th column of by solving : 

h ■= argminj^l - 20jR. hJ + (Ij R ■ ; ri ; + \\\(3 3 \\X (22) 



Tj :=yjl- 2/3j R W + (3j R\ Mj &, (23) 
®n = Tj 2 T-} and 0\,, = -^ 2 f/ 2 f (24) 
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For the estimator in (22), A is a tuning parameter. In the next section, we show that by 

choosing A = it ^ ^ , the obtained estimator achieves the optimal rates of convergence in 
the asymptotic setting. Therefore, our procedure is asymptotically tuning-parameter free. 
For finite samples, we set 



X: =Hit' (25) 

and ( can be chosen from a range [y/2/ir, 1]. Since the choice of ( does not depend on any 
unknown parameters or quantities, we call the procedure tuning-insensitive. Empirically, 
we found that in most cases we encountered, we can simply set ( = \f2/n and the resulting 
estimator works well in finite sample settings. More details can be found in the simulation 
section. 

If a symmetric precision matrix estimate is preferred, we conduct the following correc- 
tion: &jk <— min{0jfc, &kj} f° r a ll k 7^ j. Another symmetrization method is 

<^®±^. (26) 

As has been shown by Cai et al. (2011a), if is a good estimator, then will also be a 
good estimator: they achieve the same rates of convergence in the asymptotic settings. 

, — 1 /2 

Let Z G M™ be the normalized data matrix, i.e., Z*j = X^jS^. for j = 1, . . . , d. An 
equivalent form of (22) is 

ft = argminj -^=\\Z^ - Z*\jPj\\ 2 + (27) 



, ; !! Z *i - Z *\jPj\\ 2 - ( 28 ) 



Once we have 0, the estimated graph G := (V, E) where (j, k) € E if and only if 0^0^- 7^ 
0. 

3.2 Relationship with the Scaled-Lasso Estimator 

In this subsection, we show that the scaled-Lasso from Sun and Zhang (2012) can be 
viewed as a variational upper bound of the objective function of the SQRT-Lasso and they 
are solving the same problem 1 . 

More specifically, we consider the following optimization: 

h= argmm ^-20Jfl^^ A ^ 

i3 J m d - 1 ,T j >o K ZT j z J 



1 This relationship is proposed and kindly provided by Professor Cun-hui Zhang. 
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Proposition 1. The optimization in (22) and (29) provide the same solution (3j. 

Proof. For any a, b > 0, we have a 2 + b 2 > 2ab and the equality is attained if and only 
a = b. Therefore, we have 

1 - 2/3jR w + /3jR U \,/3, ( T r _ — - 



2r,- 2 



+ ^ > V 1 - 2/3j R Uj + (3j R & . (30) 



This shows that the objective function in (29) is an upper bound of the objective function 
in (22). The equality is attained if and only if 



Tj \1 2/3f R W + 0] R , /i,. (31) 
We finish the proof. □ 

This relationship between the TIGER and scaled-Lasso provides an efficient algorithm 
as described in the next subsection. 

3.3 Computational Algorithm 

Equation (22) is jointly convex with respect to j3j and Tj and can be solved by a coordinate- 
descent procedure. In the t th iteration, for a given rf \ we first solve the subproblem 



/3 := argmm J - --^ +A \\0j ^, 



(32) 



This is a Lasso problem and can be efficiently solved by the coordinate descent algorithm 
developed by Friedman et al. (2007). Once /3^* +1 ^ is obtained, we can calculate 'rj*" 1 " 1 ' 1 as 



Tr > = Vl _ 2(rT% + {pr >y^(pr>). (33) 

We iterate these two steps until the algorithm converges. 

On thing to note is that the above algorithm converges fast if a good initial value of 
Tj is provided. For example, if Tj is close to Tj, the algorithm converges in only 3 to 5 
iterations. In fact, with a good initial value of Tj, the computation is roughly the same as 
running one single tuning parameter of the Lasso with a sparse solution. However, with a 
bad initial value of Tj , the computational complexity can be as heavy as calculating the full 
regularization path of a Lasso problem, which is less efficient. 

To obtain a good initial estimate of Tj , we propose an augmented Lagrange method as 
follows: We first reparameterize (27) and (28) as 

(3j,y= argmin j— 1| 7 || +A||/3j|| subject to 7 = Z*, - Z m \j/3j 1 , (34) 
/3 1 eM d - 1 ,TeM n I v n J 



1 . 



7|| 2 - (35) 
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We consider the following augmented Lagrangian function 

£(/3,, 7 , a) := -^|| 7 || 2 + 4^ + P<<*,7 - Z*, + Z^/3,) + £|| 7 - Z,, + (36) 



/n M 112 11 ,/M1 x J v ■" 2 

where /j > is the penalty parameter for the violation of the linear constraints. For simplic- 
ity, throughout this paper we assume that p > is fixed (In implementations, we simply set 
p = 1). a G W 1 is the Lagrange multiplier vector but rescaled by p for computational and 
notational convenience. This reparametrization decouples the computational dependency 
in the optimization problem. Therefore a complicated problem can be split into multiple 
simpler sub-problems, each of which can be solved easily. 

The augmented Lagrangian method works in an iterative fashion. Suppose we have the 
solution (3^\ 7^, at the i-th iteration, the algorithm proceeds as follows: 
Step 1. Update (3j by 

= argminjAll/3,4 + + 7 « - Z mj + Z,^-^). (37) 

Let «W := Z*j — a'*' — 7^. and A p := X/p. The problem in (37) is equivalent to 

+1) = argmin jAj^^ + hu® - Z^|| 2 j- (38) 

This is a Lasso subproblem which can be efficiently solved by the coordinate descent algo- 
rithm (Friedman et al., 2007). 
Step 2. Given pf +1) , we then update 7 by 

7< t+1 ) = a rgmin{^|| 7 || 2 + f || 7 - Z*; + Z^ +1) + aW||^. (39) 
The problem (39) has the closed-form solution by soft-thresholding, 

= (z., - Z.^ -<,''»)■ (l- 1 ) , (40) 

where (x)+ := max{0,x} is the hinge function. 

Step 3. Given 7 (* +1 ) and f3^ +l \ we update the Lagrange multiplier a by 

= Q W + 7 (' +1 ) - Z.,- + Z Aj /3f (41) 

Since we have rescaled the Lagrange multiplier a by p, the updating equation for a is 
independent of p. 

The algorithm stops when the following convergence criterion is satisfied 



™ J „^„ J , J „^„ <^ ( 42 ) 



2 



11 



where e > is a precision tolerance parameter. Theoretically, we can set e to be a very 
small value, e.g. e = 10~ 6 . This directly solves (3j. However, empirically, we found it is 
more efficient to set e = 10 -3 and only obtain a crude initial estimate r^ mde . We then use 
^crude ag ^ e initial value and alternatively solve (32) and (33). Such a hybrid procedure 
delivers the best empirical performance. 

3.4 Fine-tune the Regularization Parameter 

To secure the best finite sample performance, we could also find-tune the (, in (25) on a small 
interval [y/2/ir, 1]. Practically, due to the tuning-insensitive property of our procedure, we 
find it suffices to only cross-validate 3 values and pick the best one: ( € {\/2/7r, 0.6, l}. 
In general, all these three values guarantee that the solutions are relatively sparse, the 
algorithm runs very efficiently. 



4 Theoretical Properties 

In this section we investigate the theoretical properties of the proposed method. We begin 
with some notations and assumptions. We define a matrix class A4(£ max , k): 

M(Ua*,k) := {e = T G R dxd : >- 0, ^ maxV/(0 4J + 0) < fc),(43) 

3 

where ^ max is a constant and k may scale with the sample size n. We first list down three 
required assumptions: 

(Al) 6 M(U,„k), 

(A2) k 2 log d = o(n), 

9 log d 1 

(A3) limsup^^maxK^rfS^- — < -. 

All these assumptions are mild. Assumption (Al) only requires the inverse covariance 
matrix := S -1 to have a bounded condition number. Assumption (A2) is equivalent to 

lim k\l^ = 0. (44) 

n— >oo V n 

In later analysis, we will show that this condition is necessary to secure the consistency of 
the precision matrix estimation under different matrix norms. Assumption (A3) constrains 
that the marginal variance of Xj should not diverge too fast. 
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4.1 Precision Matrix Estimation Consistency 

We study the estimation error of precision matrix — under different norms, including 
spectral norm, matrix Li-norm, elementwise sup-norm, and Frobenius norm. The rate 
under the elementwise sup-norm is important for graph recovery. The rate under the 
spectral norm is important since it leads to the consistency of the estimation of eigenvalues 
and eigenvectors, which may further be utilized to analyze the theoretical properties of 
downstream statistical inference. We analyze the rate of spectral norm by bounding the 
Li-norm rate. The rate of Frobenius is also a useful measure on the accuracy of the 
estimation of and can be viewed as the sum of squared errors for estimating individual 
rows. Our main results indicate that the TIGER procedure simultaneously achieves the 
optimal rates of convergence under all these different matrix norms. We present these 
results in two main theorems and compare our results with related work in the literature. 
The proofs of these theorems can be found in the appendix. 

Theorem 2 provides the rates of convergence under the matrix L\ and spectral norms. 

Theorem 2 (L\ and spectral norm rates). We choose the regularization parameter A as in 
(25) with C = 1- Under Assumptions (Al), (A2), and (A3), we have 

l|e-e|li = oK*ll e ll*V^r)' (45) 
||e-e|| 2 = o P (*||e|| 2 ^). (46, 

Proof. The proof of (45) can be found in Appendix D.l. The proof of (46) follows from 
the inequality that ||© — @|| 2 <||@ — ©|| r □ 

If we further assume 1 1 G> 1 1 x < where may scale with d, i.e., we define the following 
new matrix class M. (M^, £ max > k): 

M(M d ,U^k) := {© G M(U^k) : H©^ < M d }. (47) 

The result of Theorem 2 implies that 



sup 

] 




Based on the results of Cai et al. (2011b), Liu and Luo (2012), and Yuan (2010), this rate 
of convergence is minimax optimal on model class M(Md, £max, k). 

The next theorem provides the rates of convergence under the elementwise sup-norm 
and Frobenius norm. The elementwise sup-norm result is useful for the graph recovery from 
the precision matrix 0. 
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Theorem 3 (Elementwise sup- norm and Frobenius-norm rates) . We choose the regulariza- 
tion parameter A as in (25) with £ = 1. Let s be the total number of nonzero off-diagonal 
elements of®. Under Assumptions (Al), (A2), and (A3), we have 

l|e-e|U = oKlNi\/^). <*» 
||§-e|| F = o P (||e|| i ys4^). (so, 

Proof. The proof of (49) can be found in Appendix D.2. To prove (50), since s is the total 
number of nonzero off-diagonal elements of 0, we have 

||e-e|| F < v^+d-||e-e|| < c llelL ■ (5 i) 

II 1 1 F — II II max — II 111 y n v ' 

where the second inequality follows from (49). □ 

Again, based on the results in Cai et al. (2011b) and Liu and Luo (2012), we know that 
the TIGER achieves the minimax optimal rates of convergence under both elementwise 
sup-norm and Frobenius norm on model class A4(Md,£ max , k). 

In summary, the TIGER simultaneously achieves the optimal rates of convergence for 
precision matrix estimation under spectral norm, Frobenius norm, matrix Li-norm, and 
elementwise sup-norm. 

4.2 Graph Recovery Consistency 

Next, we study the graph recovery property of the TIGER. Let be the estimated precision 
matrix. Recall that we define the estimated graph G := (V, E) where (j, k) G E if and only 
if ^ 0. Similarly, the true graph G := (V,E) where (j,k) G E if and only if 

Qjk 7^ 0. We have the following theorem on graph recovery consistency. 

Theorem 4 (Graph recovery consistency). We choose the regularization parameter X as in 
(25) with ( = 1. We assume Assumptions (Al), (A2), and (A3) hold and for a sufficiently 
large constant K , such that the smallest nonzero element of satisfies 

crit := min \® jk \ > K\\G\L\ (52) 

We have liminf P (e C e) =1. 

Theoretically, it is possible that the TIGER delivers some precision matrix estimates 
with very small nonzero values. To achieve exact recovery, we can hard threshold the 
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estimated precision matrix to get a sparse precision matrix estimate as has been 
discussed in Cai et al. (2011a). Let 



It can be seen that under the same conditions of Theorem 4, there exists a constant A (A 
may depend on K) such that the above hard threshold estimator achieves exact recovery. 
More discussions can be found in Cai et al. (2011a). Unlike the CLIME and graphical 
Dantzig selector where the linear program solver may deliver very small nonzero values (but 
not exact zero). Our algorithm defined in Section 3 is based on soft-thresholding operator 
and is capable of delivering exact zero. Empirically, we found the TIGER procedure works 
very effectively in graph estimation even without this hard-thresholding step. So this hard- 
thresholding step is more of theoretical intents and is not necessary in applications. 

It is worth pointing out that if we are only interested in estimating the graph, assumption 
(A2) can be relaxed to klogd = o(n), see for example Meinshausen and Buhlmann (2006) 
and Jalali et al. (2012). Such an extension is relatively straightforward and will be reported 
elsewhere. 

4.3 Discussion 

In this subsection, we briefly discuss the theoretical properties of the TIGER estimator and 
compare them with other existing results. 

The SCIO method proposed in Liu and Luo (2012) also provides the rates of convergence 
for precision matrix estimation under various norms. One can see that the TIGER estimator 
and SCIO estimator achieve the same rates of convergence in terms of spectral norm, 
matrix Li-norm, Frobenius norm, and elementwise sup-norm. However, here we consider 
a much larger matrix class since we only bound the condition number of the precision 
matrix while the SCIO bounds the largest and smallest eigenvalues from above and below, 
respectively. Moreover, the SCIO requires the irrepresentable condition, which is stronger 
than our condition and not required by the TIGER. In fact, it is still an open problem on 
whether the SCIO estimator can achieve the same rates as the TIGER on the model class 



Note that the graphical Dantzig selector proposed in Yuan (2010) also considers the 
model class where the largest and smallest eigenvalues are bounded from above and below. 
Therefore the results of the Graphical Dantzig selector are again on a smaller model class 
than the TIGER estimator. Moreover, the graph recovery performance of the graphical 
Dantzig selector is still an open problem. 




M{M d ,£ max ,k). 
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When compared with the CLIME in Cai et al. (2011a), we see that the rate of conver- 



gence of the TIGER is faster since the spectral norm rate of the CLIME is Op ( kM\ 



When compared with the glasso method, see for example Rothman et al. (2008) and 
Ravikumar et al. (2011), the TIGER estimator achieves the minimax optimal rates of 
convergence under spectral norm, Frobenius norm, matrix Li-norm and elementwise sup- 
norm. In contrast, the only theoretical result of the glasso is in terms of Frobenius norm, 
it is still an open problem on whether the glasso can achieve the same spectral norm rate 
of convergence as the TIGER method. 

The scaled-Lasso estimator proposed in Sun and Zhang (2012) provides the rates of 
convergence under spectral norm and matrix L\ norm of the precision matrix estimation. 
However, the scale-Lasso estimator considers a different model class where the smallest 
eigenvalue of the correlation matrix is bounded away from zero by a constant. It is not 
clear on the optimality of the obtained rates of convergence over that model class. Besides, 
Sun and Zhang (2012) does not provide the elementwise sup-norm analysis and hence does 
not have the graph recovery result. Though the TIGER has a close relationship with the 
scaled-Lasso, the theoretical analysis of the current paper is dramatically different from 
that in Sun and Zhang (2012). 

Another related work is Meinshausen and Biihlmann (2006), where they only focus on 
the graph recovery and do not have precision matrix estimation result. 

5 Experimental Results 

We compare the numerical performance of the TIGER and other methods (glasso and 
CLIME) in parameter estimation and graph recovery using simulated and real datasets. 
The TIGER and CLIME algorithms are implemented in the R package bigmatrix, and it 
is publicly available through CRAN. The glasso is implemented in the R package huge (ver. 
1.2.3). 

5.1 Numerical Simulations 

In our numerical simulations, we consider 6 settings to compare these methods: (i) n = 
200, d = 100; (ii) n = 200, d = 200; (iii) n = 200, d = 400; (iv) n = 400, d = 100; (v) 
n = 400, d = 200; (vi) n = 400, d = 400. We adopt the following models for generating 
undirected graphs and precision matrices. A typical run of the generated graphs and the 
heatmaps of the precision matrices are illustrated in Figure 1 . 

* Scale-free graph. The degree distribution of the scale-free graph follows a power law. 




and the spectral norm rate of the TIGER is Op I kM^ 
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(d) Cluster (e) Band (f) Block 




(j) Cluster (k) Band (1) Block 



Figure 1: An illustration of the 6 graph patterns and the heatmaps of their correspond- 
ing precision matrices adopted in the simulations. To ease visualization, we only present 
example graphs with d = 200 nodes. 



The graph is generated by the preferential attachment mechanism. The graph begins with 
an initial small chain graph of 2 nodes. New nodes are added to the graph one at a time. 
Each new node is connected to one existing node with a probability that is proportional to 
the number of degrees that the existing node already has. Formally, the probability pi that 
the new node is connected to an existing node i is, pi = Jv, where k{ is the degree of node 
i. The resulting graph has d edges (d = 200 or d = 400). Once the graph is obtained, we 
generate an adjacency matrix A by setting the nonzero off-diagonal elements to be 0.3 and 
the diagonal elements to be 0. We calculate its smallest eigenvalue A m i n (A). The precision 
matrix is constructed as 

= D [A + (| A min (A) | + 0.2) • I d ] D, (53) 

where D G M. dxd is a diagonal matrix with Djj = 1 for j = 1, . . . ,d/2 and = 3 for 
j = d/2 + 1,. .. ,d. The covariance matrix X! := 1 is then computed to generate the 
multivariate normal data: x±, . . . , x n ~ AT d (0, £). 

* Erdos-Renyi random graph. We add an edge between each pair of nodes with proba- 
bility 0.02 independently. The resulting graph has approximately 400 edges when d = 200 
and 1, 596 edges when d = 400. Once the graph is obtained, we construct the adjacency ma- 
trix A and generate the precision matrix using (53) but setting Djj = 1 for j = 1, . . . , d/2 
and Djj = 1.5 for j = d/2 + 1, . . . , d. We then invert to get the covariance matrices X! 
and generate the multivariate normal data: X\, . . . , x n ~ 

* Hub graph. The d nodes are evenly partitioned into d/20 disjoint groups with each 
group contains 20 nodes. Within each group, one node is selected as the hub and we add 
edges between the hub and the other 19 nodes in that group. The resulting graph has 190 
edges when d = 200 and 380 edges when d = 400. Once the graph is obtained, we generate 
the precision and covariance matrices in the same way as the Erdos-Renyi random graph 
model. 

* Cluster graph. Similar to the hub model, the d nodes are evenly partitioned into d/20 
disjoint groups with each group contains 20 nodes. The subgraph of each group is an 
Erdos-Renyi random graph with the probability parameter 0.2. The resulting graph has 
approximately 380 edges when d = 200 and 760 edges when d = 400. Once the graph 
is obtained, we generate the precision and covariance matrices in the same way as the 
Erdos-Renyi random graph model. 

* Band graph. Each node is assigned a coordinate j with j = l,...,d. Two nodes are 
connected by an edge whenever the corresponding coordinates are at distance less than or 
equal to 3. The resulting graph has approximately 594 edges when d = 200 and 1, 194 
edges when d = 400. Once the graph is obtained, we generate the precision and covariance 



18 



matrices in the same way as the Erdos-Renyi random graph model. 

* Block graph. The precision matrix is a block diagonal matrix with block size d/20. 
Each block has off-diagonal entries equal to 0.5 and diagonal entries equal to 1. Such 
a matrix is guaranteed to be positive definite. The resulting matrix is then randomly 
permuted by rows and columns. The resulting graph has approximately 900 edges when 
d = 200 and 3, 800 edges when d = 400. The covariance matrix I] := D _1 @ _1 D _1 is then 
computed to generate multivariate normal data, where D is a diagonal matrix with ~Djj = 1 
for j = 1, . . . , d/2 and D# = 1.5 for j = d/2 + 1, . . . , d. 

5.2 Graph Recovery Performance 

We first compare the TIGER with the CLIME and glasso on their graph recovery perfor- 
mance. Let G = (V, E) be a d-dimensional graph. We denote by \E\ the number of edges in 
the graph G. We use the false positive and false negative rates to evaluate the graph recov- 
ery performance. Let G x = (V, E x ) be an estimated graph using a regularization parameter 
A in any of these procedures. The number of false positives when using the regularization 
parameter A is defined as FP(A) := the number of edges in E x but not in E. The number 
of false negatives with A is defined as FN(A) := the number of edges in E but not in E x . 
We further define the false negative rate (FNR) and false positive rate (FPR) as 

FNR(A) := and FPR(A) := FP(A)/[Q -\E\]). (54) 

To illustrate the overall performance of the TIGER, CLIME and glasso methods over the 
full paths, the receiver operating characteristic (ROC) curves are drawn using (FNR(A), 1 — 
FPR(A)). For the TIGER method, we found that using or without using the second hard- 
thresholding step provides the same graph estimates. So, the presented result does not 
use the second hard-thresholding step. The ROC curves for these models are presented in 
Figures 2, 3, 4. 

From the ROC curves on the scale-free model in Figure 2, we see that the graph recovery 
performance of the TIGER is significantly better than those of the CLIME and glasso in 
higher dimensional settings (when d = 200 or d = 400 and d > n). This result suggests 
that the TIGER is more adaptive to inhomogeneous noise models. From Figure 2, we also 
see that the TIGER has similar graph recovery performance as the CLIME on the band 
model. Both the TIGER and CLIME significantly outperform the glasso. In particular, 
in the high dimensional setting when d = 400 and n = 200, the TIGER outperforms both 
CLIME and glasso. This result suggests that the TIGER is more reliable when facing higher 
dimensional problems on this model. 

For the other models, from the ROC curves in Figures 3 and 4, we see that the three 
methods perform similarly on these settings, and for Erdos-Renyi random graph models, 
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Figure 2: ROC curves for the Scale- free and Band models (Best visualized in color). 
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Figure 3: ROC curves for the cluster and Erdos-Renyi random graph models (Best visual- 
ized in color). 
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Figure 4: ROC curves for the hub and block models (Best visualized in color). 
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the TIGER outperforms the CLIME in the settings when n < d. This means that the 
TIGER is effective for a wide range of models. In summary, the above numerical results 
suggest that the TIGER is a very competitive graph estimation method in high dimensions. 

5.3 Tuning-Insensitive Regularization Path 

We are interested in studying the tuning-insensitive property of the TIGER. For conciseness, 
we only discuss the TIGER and CLIME in this section and compare their regularization 
paths. Recall that 

we see that £ and A have a one-to-one mapping. In Figure 5 (a) and (b), we plot the curves 
of Frobenius-norm errors vs. the tuning parameter £ for the TIGER and CLIME. We define 
FNR(£) and FPR(£) in the same way as in (54). We also define the graph recovery accuracy 
as 

Accuracy(C) := 1 - FPR(C) - FNR(£). (56) 

In Figure 5 (c) and (d), we plot the curves of the graph recovery accuracy vs. the tuning 
parameter £ for the TIGER and CLIME. The vertical axis of these plots are calibrated 
so that the results are directly comparable. These plots illustrate the tuning-insensitive 
property of the TIGER regularization path. For the TIGER, we found it is empirically 
safe to only consider the regularization path over the range £ 6 1]. From both 

subplots (a) and (c), the regularization paths are flat and do not change dramatically with 
the change of £ (In another word, the procedure is insensitive to the tuning parameter). 
In contrast, for the CLIME, we need to search over a much larger range of £ to find the 
optimal value and the paths are more irregular. In the subplots (b) and (d), we visualize 
the regularization paths of the CLIME over £ G [0.125,2], these are only a sub-fraction of 
the whole regularization paths of the CLIME and the paths are irregular (or more sensitive 
to the choice of £). Therefore, it is much easier to choose a reasonable tuning parameter 
for the TIGER than for the CLIME. In most cases, the choice of £ = 1 for TIGER provides 
a sparser graph estimate than the true graph. This is due to the fact that the asymptotic 
analysis in the previous section involves many union bounds, which may be too conservative 
in finite sample settings. As will be illustrated in the next subsection, we found that in 
most settings, simply setting £ = y/2/ir yields a reasonably good graph and precision matrix 
estimates. Such a choice is used as a default in the bigmatrix package on CRAN. 
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Figure 5: Comparison of the regularization paths of the TIGER and CLIME on the hub 
graph model. The vertical axis of these plots (Forebius divide by the dimensionality d) are 
calibrated so that the results are directly comparable. These plots illustrate the tuning- 
insensitive property of the TIGER regularization path. For the TIGER, we only need to 
consider the range ( G [\/2/tt, 1]. From both subplots (a) and (c), the regularization paths 
are flat and do not change dramatically with £. In contrast, for the CLIME, we need to 
search over a much larger range of £ to find the optimal value and the paths are more 
irregular (the subplots (b) and (d) only show a part of this range). 
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5.4 Quantitative Evaluation on Parameter Estimation Performance 

We then compare the TIGER with CLIME and glasso on their parameter estimation per- 
formance. For each model described before, we generate a training sample of n. The tuning 
parameters of the glasso and CLIME are automatically chosen in a data-dependent way. 
More specifically, for a given sample size n, we generate the same number of independent 
data points from the same distribution as a validation set. For each tuning parameter, the 
glasso or CLIME estimated precision matrix is calculated on training data. The optimal 
tuning parameter is chosen by minimizing the held-out negative log-likelihood loss 

£„(©) :=tr(S0) -log|©| (57) 

on the validation set. For the TIGER, the tuning parameter £ in (25) is simply chosen 
to be C = v^/tt so that the procedure is completely tuning- free. For dimensionality d = 
100,200,400, we consider the spectral norm error ||@ — ©|| and Frobenius norm error 
— ©|| F of all the 6 models described in the previous subsections. 

The results are reported in Tables 1 and 2. In these tables we present the mean and 
standard deviation (in the parenthesis) of the spectral and Frobenius norm errors based on 
50 random trials. We see that in almost all cases, the TIGER and CLIME outperform the 
glasso. In most cases, the TIGER outperforms the CLIME. Our results, though obtained 
in different experimental settings, are consistent with the results in Sun and Zhang (2012) 
for the scaled-Lasso based matrix inversion method. 

One possible reason for the superior empirical performance of the TIGER over CLIME 
and glasso is that the data-dependent tuning selection procedure described in the previous 
section does not work well for the CLIME and glasso. To gain more insight, we also report 
the oracle estimation results in Tables 3 and 4. In these tables we present the mean and 
standard deviation (in the parenthesis) of the spectral and Frobenius norm errors based on 
50 random trials. For all three methods, we draw the full regularization paths and select 
the best tuning parameter by minimizing the corresponding spectral or Frobenius norm 
errors to the true precision matrix. From these tables, we see that again the TIGER and 
CLIME outperform glasso and the TIGER is slightly better than CLIME. 

5.5 Gene Network 

This dataset includes 118 gene expression arrays from Arabidopsis thaliana originally ap- 
peared in Wille et al. (2004). Our analysis focuses on gene expression from 39 genes involved 
in two isoprenoid metabolic pathways: 16 from the mevalonate (MVA) pathway are located 
in the cytoplasm, 18 from the plastidial (MEP) pathway are located in the chloroplast, 
and 5 are located in the mitochondria. While the two pathways generally operate indepen- 
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Table 1: Quantitative comparisons of the TIGER, Glasso, and CLIME on the scale-free, 
hub, and band models using data-dependent model selection method. 
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(0. 10(2) 


(U.369DJ 


400 


200 


2.68762 
(0.2668) 


3.29772 
(0.2480) 


5.23455 
(0.2602) 


11.7521 

(0.4706) 


14.0718 
(0.5345) 


23.3348 
(0.3524) 




400 


3.31452 


3.57110 


6.32043 


16.9996 


19.1326 


41.8133 




(0.5758) 


(0.3068) 


(0.2675) 


(0.6411) 


(0.7619) 


(0.3717) 




100 


2.67040 


4.31716 


6.51814 


5.4347 


7.3545 


10.6739 




/a AAA d\ 

(0.4440 J 


(U.49<9J 


(0.3070) 


(0.3356 J 


/A KOA^\ 

(0.52U4J 


1 A /1QOA\ 

(U.4o29J 


hub 200 


200 


3.02307 
(0.4151) 


5.35713 
(0.3096) 


7.41337 
(0.1374) 


8.2277 
(0.3049) 


12.7079 
(0.4182) 


18.1376 
(0.1653) 




400 


3.34315 


6.11506 


7.69167 


12.0676 


19.8038 


26.4038 




(0.3017) 


(0.2093) 


(0.1061) 


(0.2801) 


(0.3058) 


(0.2002) 




100 


1.82245 


2.46543 


5.28610 


3.7161 


4.5920 


8.5380 




/a i o\ 
(0.2619 ) 


(U.3995 J 


(0.1834) 


/A 1 O 1 1 \ 

(0.1811 ) 


/A nft^?o\ 

(0.2968 ) 


( A OO A *7\ 

(U.224fJ 


400 


200 


2.07601 


3.11260 


6.23933 


5.6517 


7.5627 


15.0739 


(0.2105) 


(0.2491) 


(0.1208) 


(0.2366) 


(0.3261) 


(0.1747) 




400 


2.23719 


4.02006 


6.43531 


8.1420 


13.0960 


22.0039 




(0.1675) 


(0.4557) 


(0.0913) 


(0.1207) 


(1.1531) 


(0.1230) 




100 


5.72715 


4.37815 


6.36789 


16.7205 


12.8498 


17.4244 




^U. 1D40 ) 


^U.ouZD ) 


(0.0850) 




l^U.oouo ) 


\\J. 10UD ) 


band 200 


200 


6.04373 
(0.1381) 


5.81030 
(0.1378) 


7.32236 
(0.0517) 


24.6770 
(0.2373) 


23.3002 
(0.3368) 


28.8514 
(0.1293) 




400 


6.28046 
(0.0760) 


6.76049 
(0.0738) 


7.87172 
(0.1168) 


36.0083 
(0.1940) 


38.6665 
(0.1779) 


44.7691 
(0.7423) 




100 


4.34163 


2.92088 


5.53820 


12.5012 


8.2567 


14.7940 




(0.1847) 


(0.2552) 


(0.0871) 


(0.2989) 


(0.5046) 


(0.1483) 


400 


200 


4.69878 
(0.1302) 


3.62779 
(0.1494) 


5.71676 
(0.0740) 


19.0534 
(0.2633) 


14.3844 
(0.2863) 


21.9130 
(0.1285) 




400 


5.01406 
(0.0709) 


4.24658 
(0.4150) 


6.82527 
(0.0392) 


28.6523 
(0.2313) 


24.1215 
(2.0819) 


37.6074 
(0.0953) 
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Table 2: Quantitative comparisons of the TIGER, Glasso, and CLIME on the block, Erdos- 
Renyi random, and cluster models using data-dependent model selection method. 











Spectrum Norm 


Frobenius Norm 


Model 


n 


d 


TIGER 


CLIME 


G Lasso 


TIGER 


CLIME 


G Lasso 






100 


3.88080 


2.92791 


4.62777 


12.7803 


9.8588 


14.1474 






(0.2123) 


( t\ 1 AA7\ 

(U.iyov ) 


^U.lzzl ) 


(U.22ol ) 


(U.2 ( 28 J 


(U.o I O ( ) 


block 


200 


200 


4.21258 
(0.2073) 


3.50847 
(0.2014) 


5.05219 
(0.0408) 


19.1984 
(0.1841) 


16.4653 
(0.2043) 


21.8517 
(0.0455) 






400 


4.54196 
(0.1456) 


4. ( Zooo 

(0.1344) 


0.44Z0U 

(0.0387) 


zo.oy4U 
(0.1471) 


on 10.1 A 

zy. * d / 4 
(0.1169) 


Q/1 noes 

(0.2657) 






100 


2.61796 


1.91879 


3.63244 


8.4651 


6.1005 


10.7808 






(0.1746) 


(0.2812 J 


1 A AAC?A\ 

(u.uyoyj 


(0.2232 ) 


/'n a q aq\ 
(0.6898 J 


( a nnfi a \ 

(u.uyo4j 




400 


200 


2.86024 
(0.1613) 


2.04201 
(0.1219) 


4.26378 
(0.0476) 


12.8697 
(0.2201) 


9.7477 
(0.2173) 


17.7396 
(0.0674) 






400 


3.12185 


3.03083 


4.70495 


19.4939 


19.6861 


28.0988 






(0.1423) 


(0.1365) 


(0.0333) 


(0.2277) 


(0.1791) 


(0.0522) 






100 


1.40361 


1.63446 


2.66316 


4.9173 


5.5390 


7.4979 






(0.2093) 


(0.224b J 


(U.2o20J 


(0.2108 ) 


(0.20 11) 


1 C\ KOI c\ 

(U.5315 J 


random 


200 


200 


1.92515 


2.13847 


2.97502 


9.3623 


10.4678 


13.2284 


(0.1352) 


(0.1984) 


(0.2074) 


(0.1845) 


(0.5063) 


(0.8328) 






400 


3.03486 


3.57549 


4.16129 


17.6548 


20.1956 


23.3014 






(0.0710) 


(0.0417) 


(0.0313) 


(0.1165) 


(0.1937) 


(0.0777) 






100 


0.96871 


1.08246 


1.94886 


3.3962 


3.7713 


5.4841 






(0.1265) 


( r\ i A qa\ 
(0.142U J 


1 A 1 Q^A** 

(U.l8(9j 


/a i ooo\ 
(0.1332 ) 


/'A 1 KOI \ 

(0.1521 J 


(U.2022 J 




400 


200 


1.38675 


1.45379 


2.20816 


6.6106 


7.2891 


9.6509 




(0.1133) 


(0.1517) 


(0.0634) 


(0.1273) 


(0.2580) 


(0.1012) 






400 


2.21101 


2.49335 


3.02710 


13.3298 


14.9634 


17.0508 






(0.0792) 


(0.2862) 


(0.0468) 


(0.2506) 


(0.7566) 


(0.1373) 






100 


3.84966 


3.45717 


5.35790 


8.9219 


8.3992 


11.7727 






(0.3371) 


( A Q/lfiE^ 

^U.o40D ) 


loZO ) 


^U.Z / 1U ) 


( a qaon\ 


fn 1/1 97A 
l^U. 14Z 1 ) 


cluster 


200 


200 


3.66157 
(0.2469) 


3.90672 
(0.3981) 


5.11136 
(0.2043) 


11.6676 
(0.2153) 


12.1507 
(0.7474) 


16.0086 
(0.5784) 






400 


2.99469 
(0.1527) 


3.34376 
(0.1257) 


4.11403 

(0.0848) 


15.1022 
(0.1233) 


16.5420 
(0.1500) 


20.2334 
(0.1057) 






100 


2.74935 


2.32058 


4.21725 


6.4102 


5.7983 


8.9115 






(0.2604) 


(0.2737) 


(0.4733) 


(0.2067) 


(0.2324) 


(0.8694) 




400 


200 


2.97759 
(0.1958) 


2.80625 
(0.2788) 


4.18294 
(0.0969) 


8.7524 
(0.1863) 


8.4317 
(0.3450) 


11.9685 
(0.1067) 






400 


2.20812 


2.29522 


3.58522 


11.0885 


11.4516 


16.8512 






(0.0627) 


(0.0675) 


(0.0630) 


(0.1962) 


(0.1888) 


(0.1091) 
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Table 3: Quantitative comparisons of the TIGER, Glasso, and CLIME on the scale-free, 
hub, and band models using oracle model selection method. 







Spectrum Noi 


:m 


Frobenius Norm 


Model n 


d 


TIGER 


CLIME 


G Lasso 


TIGER 


CLIME 


G Lasso 




100 


3.49331 


4.20880 


4.44102 


11.4419 


14.0362 


14.6163 




(U.zooo ) 


t a Qont^ 
(U.ooaD J 


(0.2517) 


(U. ( Uol J 


/A QC\H r 7\ 
(U.OUO 1 ) 


(u.4oyo ) 


scale-free 200 


200 


3.55648 
(0.4300) 


3.97078 
(0.3312) 


4.06802 
(0.3903) 


15.5481 
(0.5613) 


16.8466 
(0.4808) 


20.6089 
(0.2790) 




400 


3.93523 
(0.5407) 


4.50636 
(0.4290) 


4.36077 
(0.2883) 


21.2401 
(0.5478) 


23.6391 
(0.2663) 


32.9509 
(0.1983) 




100 


2.65555 


2.92806 


3.72758 


8.3378 


10.1875 


11.5142 




/a a n\ 
(0.2O4U ) 


(U.2093J 


(0.2702) 


(0.02/ / ) 


/A /10QA\ 

(0.422U ) 


(U.4212J 


400 


200 


2.62826 

(0.2368) 


3.20446 
(0.1861) 


3.46773 
(0.3240) 


11.7350 
(0.4554) 


13.9074 
(0.3791) 


16.2752 
(0.2982) 




400 


3.10365 


3.42918 


3.74972 


16.7337 


18.9504 


25.1616 




(0.4002) 


(0.2338) 


(0.3319) 


(0.5986) 


(0.4950) 


(0.2071) 




100 


2.42283 


2.66693 


4.13856 


5.4124 


6.6481 


9.6412 




/a ooc7\ 
(0.328 I ) 


(U.3362 J 


(0.2862) 


/A OOI Q\ 

(0.3213 J 


/A 

(0.3566 ) 


1 A OOA/I \ 

(U.2294J 


hub 200 


200 


2.80843 

(0.2737) 


3.37883 
(0.2853) 


4.18505 
(0.2116) 


8.2170 

(0.3008) 


10.9757 
(0.3744) 


15.5200 
(0.1934) 




400 


3.18571 


4.51681 


4.06350 


12.0676 


15.5895 


24.0977 




(0.2904) 


(0.3600) 


(0.0675) 


(0.2801) 


(0.2402) 


(0.1782) 




100 


1.63670 


1.70506 


3.31738 


3.6970 


4.2848 


7.6101 




/a i oao\ 
(0.1863 ) 


(U.2UU1 J 


(0.2704) 


/A 1 *7RO\ 

(0.1 too ) 


/A O A QO\ 

(0.2483 ) 


1 A OATt?\ 

(U.2U/6J 


400 


200 


1.93008 


2.23970 


3.52276 


5.6341 


6.4934 


12.3937 


(0.2119) 


(0.2799) 


(0.1672) 


(0.2342) 


(0.2451) 


(0.1873) 




400 


2.10478 


2.49148 


3.45920 


8.1420 


10.4606 


19.0323 




(0.1478) 


(0.2029) 


(0.1311) 


(0.1207) 


(0.1850) 


(0.1332) 




100 


3.29816 


3.27817 


4.69050 


12.2957 


11.7620 


14.0160 




yJ.ALA ( ) 


(C\ OKI Q\ 


(0.1342) 




^U.4yUo ) 




band 200 


200 


3.89225 
(0.2489) 


4.53169 
(0.2191) 


4.78844 
(0.1114) 


19.7902 
(0.3364) 


20.0180 
(0.3778) 


22.8946 
(0.2232) 




400 


4.72424 
(0.1313) 


5.57854 
(0.1061) 


4.77270 
(0.0857) 


31.1798 
(0.2101) 


34.4919 
(0.3571) 


37.4700 
(0.1044) 




100 


2.09320 


2.18392 


3.87918 


7.8187 


7.4255 


10.8361 




(0.1604) 


(0.2292) 


(0.1348) 


(0.2798) 


(0.2845) 


(0.2176) 


400 


200 


2.44876 
(0.1452) 


2.65924 
(0.1878) 


4.00698 
(0.1037) 


12.8840 
(0.1946) 


12.3307 
(0.2244) 


17.2196 
(0.1598) 




400 


2.82127 
(0.1166) 


2.94520 
(0.0979) 


4.08170 
(0.0655) 


20.4888 
(0.2645) 


21.4755 
(0.2665) 


28.8954 
(0.1768) 
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Table 4: Quantitative comparisons of the TIGER, Glasso, and CLIME on the block, Erdos- 
Renyi random, and cluster models using oracle model selection method. 











Spectrum Norm 


Frobenius Norm 




Model 


n 


d 


TIGER 


CLIME 


G Lasso 


TIGER 


CLIME 


G Lasso 






100 


1.78662 


1.96932 


2.30284 


7.0491 


7.2065 


8.7645 






(0.1583) 


(U.i 1 IO) 


t A 1 TKAA 

(U.i / t>4 ) 


(U.ztw UJ 


(U.OZOl J 


(0.1882) 


block 


200 


200 


2.26303 
(0.1872) 


2.58484 
(0.1816) 


2.72901 
(0.0660) 


11.8684 
(0.2641) 


13.6219 
(0.2550) 


16.0190 
(0.1207) 






400 


2.79826 
(0.2608) 


O.D/Do I 

(0.2448) 


o.uu / zy 
(0.0702) 


on t^os 

ZU.OOZO 

(0.2978) 


ZO.ozyz 
(0.2781) 


25.4660 
(0.0911) 






100 


1.18086 


1.22283 


1.89098 


4.5090 


4.2706 


6.2619 






(0.1352) 


/'A 1 7HO\ 

(0.1 101 ) 


(U. 14,59] 


/A ICQl^ 

(0.1631 ) 


i A 1 OAQ\ 

(0.1898 J 


(0.1538) 




400 


200 


1.33627 
(0.1202) 


1.48764 
(0.1009) 


1.94656 
(0.1143) 


7.4712 
(0.1473) 


7.9438 
(0.1856) 


11.0249 
(0.1133) 






400 


1.58491 


2.04870 


2.58467 


11.8926 


15.1659 


19.7672 






(0.1664) 


(0.1761) 


(0.0316) 


(0.1673) 


(0.1953) 


(0.1090) 






100 


1.30656 


1.54958 


1.93012 


4.8291 


5.4697 


6.4292 






(0.1552) 


/'A 1 O A1 \ 

(0.1801 ) 


( A 1 OO A \ 

(U.1ZZ4J 


/A QAAA\ 
(U.2U9U ) 


i A 1 O r 70\ 

(0.18/8J 


(0.1705) 


random 


200 


200 


1.74892 


2.05243 


2.23740 


9.1014 


10.2764 


11.3447 


(0.1026) 


(0.0972) 


(0.0769) 


(0.1862) 


(0.2062) 


(0.1505) 






400 


2.55517 


3.49539 


3.01732 


17.0088 


20.1956 


20.0869 






(0.0663) 


(0.1640) 


(0.0501) 


(0.1298) 


(0.1937) 


(0.0937) 






100 


0.87331 


1.03685 


1.35098 


3.3014 


3.7713 


4.8714 






(0.0838) 


( A A AK r 7\ 

(0.U9D/ ) 


1 A ATM 0*\ 

(U.U 1 Iz J 


/A 1 O 1 7\ 
(0.131 I ) 


/"A 1 CO! \ 

(O.loziJ 


(0.1361) 




400 


200 


1.19717 


1.37938 


1.79435 


6.2812 


7.1464 


8.7373 




(0.0783) 


(0.0963) 


(0.0578) 


(0.1174) 


(0.1425) 


(0.0880) 






400 


1.67683 


2.11043 


2.45761 


12.3481 


14.0788 


15.6806 






(0.1227) 


(0.0694) 


(0.0596) 


(0.2473) 


(0.3149) 


(0.1436) 






100 


2.50462 


2.61965 


3.12584 


7.9860 


8.1190 


9.9439 






(0.2153) 


( O 9^87^1 
^U.ZDo / ) 


^U.ZuoO ) 


^U.ZoOO ) 




(0.1874) 


cluster 


200 


200 


2.72033 
(0.2079) 


2.89910 
(0.2528) 


2.91801 
(0.1019) 


11.1472 
(0.2200) 


11.5100 
(0.2111) 


13.2443 
(0.1515) 






400 


2.49864 
(0.1125) 


3.30405 
(0.1874) 


2.94846 
(0.1043) 


14.8704 
(0.2025) 


16.5420 
(0.1500) 


18.3397 
(0.1082) 






100 


1.63625 


1.74909 


2.47721 


5.3929 


5.4465 


7.1791 






(0.1633) 


(0.1612) 


(0.2247) 


(0.2439) 


(0.2127) 


(0.1576) 




400 


200 


1.84653 
(0.1555) 


2.04163 
(0.1747) 


2.29099 
(0.1570) 


7.7920 
(0.1718) 


7.7961 
(0.1957) 


10.4819 
(0.1135) 






400 


1.66908 
(0.1359) 


1.75361 
(0.0699) 


2.46020 
(0.0658) 


10.3907 
(0.1888) 


10.8631 
(0.1774) 


14.6437 
(0.1132) 
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dently, crosstalk is known to happen (Wille et al., 2004). Our goal is to recover the gene 
regulatory network, with special interest in crosstalk. 




Expression Value Theoretical Quantile 

(a) Histogram (b) Normal QQ-plot 



Figure 6: The histogram and normal QQ plots of the marginal expression levels of the gene 
MECPS. We see the data are not exactly Gaussian distributed. 

We first examine whether the data actually satisfies the Gaussian distribution assump- 
tion. In Figure 6 we plot the histogram and normal QQ plot of the expression levels of a 
gene named MECPS. From the histogram, we see the distribution is left-skewed compared 
to the Gaussian distribution. From the normal QQ plot, we see the empirical distribution 
has a heavier tail compared to Gaussian. To suitably apply the TIGER method on this 
dataset, we need to first transform the data so that its distribution is closer to Gaussian. 
Therefore, we Gaussianize the marginal expression values of each gene by converting them 
to the corresponding normal-scores. This is automatically done by the huge . npn function 
in the R package huge (Zhao et al., 2012). 

We apply the TIGER on the transformed data using the default tuning parameter 
£ = y/2/ir. The estimated network is shown in Figure 7. We see the estimated network 
is very sparse with only 44 edges. We draw the within-pathway connections using solid 
lines and the between-pathway connections using dashed lines. Our result is consistent 
with previous investigations, which suggest that the connections from genes AACT1 and 
HMGR2 to gene MECPS indicate a primary sources of the crosstalk between the MEP and 
MVA pathways and these edges are presented in the estimated network. MECPS is clearly 
a hub gene for this pathway. 

For the MEP pathway, the genes DXPS2, DXR, MCT, CMK, HDR, and MECPS are 
connected as in the true metabolic pathway. Similarly, for the MVA pathway, the genes 
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Figure 7: The estimated gene networks of the Arabadopsis dataset. The within-pathway 
links are denoted by solid lines and between-pathway links are denoted by dashed lines. 

AACT2, HMGR2, MK, MPDC1, MPDC2, FPPS1 and FPP2 are closely connected. Our 
analysis suggests 11 cross-pathway links, which is consistent to previous investigation in 
Wille et al. (2004). This result suggests that there might exist rich inter-pathway crosstalks. 

6 Conclusions 

We introduce a tuning-insensitive approach named TIGER for estimating high dimensional 
Gaussian graphical models. Our method is asymptotically tuning-free and simultaneously 
achieves the minimax optimal rates of convergence in precision matrix estimation under 
different matrix norms (matrix L±, spetral, Frobenius, and elementwise sup-norm). Com- 
putationally, our procedure is significantly faster than existing methods due to its tuning- 
insensitive property. The advantages of our estimators are also illustrated using both sim- 
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ulated and real data examples. The TIGER approach is a very competitive alternative for 
estimating high dimensional Gaussian graphical models. 

There are several possible directions to expand the current methods. First, it is inter- 
esting to extend the TIGER to the nonparanomral setting for estimating high dimensional 
Gaussian copula graphical models (Liu et al., 2012). Second, it is also interesting to extend 
the TIGER to more complex settings where latent variables or missing data might exist. 
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A Appendix: Proofs 

We first prove several important lemmas, followed by the proof of the main theorems. For 
notational simplicity, we use a generic constant C whose value may change from line to 
line. 

A.l Preliminaries 

For any set S C {1, 2, • • • ,d} and |5| < k, let A%(S) denote a subset of M. d defined as 

A d (S) := {(3€R d : \\(3 s 4i < c\\0s\\i,P + 0}. (58) 

Also, we define 

At(k):= (J Ai(S). (59) 

Sc{l,2,-,d},|S|<fe 



Recall that the matrix class A4(£ max , k) is defined as 

Amax ( © 

r ^ ?maxi llldJt 

I; 

let := G -M(£ max , k), we have 
A max (0) A 

max ( ^ ) A max ( 5] ) 



M(U^k) := {© = T G R dxd :@y0, < W, maxV l(& jk / 0) < k\, 



A min (0) Amin(5]- 1 ) A mi „(S) 



< Cmax- (60) 
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We define the population correlation matrix R as 

R := [diag^r^Sldiag^)]- 1 / 2 . (61) 

^—1/2 

We also define Tj := crjT-- . We recall that there are three assumptions: 
(Al) 5T 1 G M(U a *,k), 

(A2) k 2 log d = o(n), 

. _, 9 log d 1 
(A3) limsup^^maxKjXdS ■ < -. 

We define 

Qj(Pj)--=P*j-Z*\j0i\\2- ( 62 ) 
Therefore, f3j from (27) can be written as 

j = axgmm\-! r Q j (l3 j ) + \\\p\\ 1 }, for j = 1, . . . , d. (63) 

In the whole proof, for notational simplicity, we always denote the tuning parameter A 
to be 

1 2a log d 

A = C V^^' (64) 



where c > 1 and a > 2. It is easy to see that A = JL, j s a special case of this setup. 

V 2 V n 

A. 2 Technical Lemmas 

Throughout this paper we often use one of the following tail bounds for central x 2 random 
variables. We denote Xd to be a Chi-square variable with d degrees of freedom. Lemma 5 
presents some well known results of x\ an d the proofs can be found in the original papers. 

Lemma 5 (Johnstone. (2000) and Laurent and Massart (1998)). Let X ~ x% We have 

:|P (x - d > 2Vdt + 2tj , P (x - d < -2Vdt) j < exp(-t) for all t > 0, (65) 

P (\X -d\> dt) < exp(-^-dt 2 ) , for all t G [o, -) , (66) 

V 16 / L 2/ 



m&x< 



F(X< (l-t)d) <exp(-^ 2 ), for all t€ 0, -) . (67) 



The next lemma bounds the tail of sample correlation for bivariate Gaussian random 
variables. 
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Lemma 6. Let X := (Xl,X 2 ) t follows a bivariate normal distribution: 



X! 

x 2 



N 2 



Ell S12 
£21 S22 



(68) 



1 n 

Let X\, . . . ,x n 6 M. be n independent data points from X and £ :— — > X{X i be the 

i=i 

sample covariance matrix. We define the sample and population correlations p and p as 

p:=(£ n r 1/2 £i2(S22r 1/2 and p := (Eii)- 1 / 2 Si2(S 22 )- 1/2 . (69) 
For any t G [0, ^-^), we /iai>e 

\ T 3nt 2 1 

p-p\>t)<4ex.p - ^77; ; . (70) 



L 64(1 + H)2J" 

Proof. Suppose that r is a small positive number such that \p\r < t(l — r), we have 



P~P\>t > 

= p(|(s 11 )- 1/2 s 12 (s 22 )- 1 / 2 -p| >t) 

= P (|£ 12 - ^(En^) 1 / 2 ! > *(SnS22) 1/2 ) • 



Let r 



l + 2|p| +t 



, we define the event 



A:= ||En-Sii| <rSn and |£ 22 - £ 22 | < rS 22 }. 

On A, we have 

(£n£ 22 ) 1/2 | <r(EnE 22 ) 1/2 . 

Therefore, using (75), we have 
P(\p-p\ >t) 

< P (|Ei2 - ^EiiE,,) 1 / 2 ! > t(EnE 22 ) 1 / 2 I .A) + P(A C ) 

< P (|£ 12 - p(E 11 E 22 ) 1 / 2 | > _ r ) _ |p|r](SnS 22 ) 1/2 ] 
= p(|£ 12 -£ 12 | > [i(l - r) - Hr](£ii£ 22 ) 1/2 ) + P(A C ) 

< P (|E 12 - E 12 | > [t(l - r) - |p|r](£ii£ 2 2) 1/2 ) 

22) • 



+ F(A C ) 



+ P S 



'ii — ^11 



> rSn) + P( i>>, - E 22 | > /-E 



(71) 
(72) 
(73) 



(74) 

(75) 

(76) 
(77) 
(78) 
(79) 
(80) 
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Therefore, it suffices to analyze the terms P (|^12 — £12] > and P (j^n — £11 1 
For this, we denote := (xn,Xi2), where 

xa := ~ iV(0, f) and x a := ~ (0, 1). 



> 



We have 



2(1 + p) J ^ ^ L 2(1 -p) 

Furthermore, since Y^=i ~ Xn> we have 



(|E„-E U |> £ ) = p(|iE4"l|>^) 

1=1 



r 2 

i=l 



1 ne 



< exp 



E 

3ne 2 



16(S n ; 



where the last inequality follows from (67). 
Similarly, 



(|Ei2 -S12I > e) 



1 n 



~ ~ 1 e 

^ilXi2 - P > 



(SiiE22) 1/i , 

(1 n 4 \ 

I - yj [(Si, + i a f - 2(1 + p) - (xa - S i2 ) 2 +2(1-/.)] 1 > (SiiS22) i/ 2 j 

< r(|ig[( Z „+^-2(i + rt ]|>^-|^) 



. / |\-v r (Xil + Xi2) 2 1 l>( 



^ 2(1 + p) J (l + p)(E 11 S2 2 ) 1 / 2 

+Klg[^H> ( ,-,, ( w > 

Applying (67) on (91) and (92), we have 

I \ / 3ne 2 \ / 3ne 2 

P (| Sl2 - E12I > e) < exp (- 16(1 + p)2EiiS J + exp ( 



i6(i-p) 2 s 11 s2 2 y ' 
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Combining (80) with (85) and (93), we get 

>(p-h>o*m- ^::OT + M-£)- (94) 



Since 

t 

r := 



l + 2|p| 
we have 

P(\p- / ,\>t)<4^(-^0 w ,). (95) 
We thus complete the whole proof. □ 

The following lemma bounds the difference between the sample correlation matrix and 
the true correlation matrix in elementwise sup-nrom. 

Lemma 7. Let R and R be the sample and population correlation matrices. We define the 
event 



A: ={||R-R L <18^}. (96, 

Then, we have F(Ai) > 1 - 1/d. 
Proof. From Lemma 6, we have 

ff, (l|R- R L 1 ax> i )< 4d2eX p(-^)- (97) 



/ log d 

The result follows by choosing t = 18 W . □ 

V n 

Lemma 8. Let ej := (eji, . . . , e jn ) T G M ra and ~ N n (0, a^I n ). We define the event 



f ll e 'll 2 
^.2 : = ^ m & x — ^ < 1.4 and max 
[l<jr<d no-j i<j<d 

(n \ I 
) ~ 
1 00 ' d 

Proof. Since 



p-ll 2 



<3.5W^j. (98) 



n 



n 2 

Cj -||2 



= E3-Xl 09) 



2 

j i=i j 



by Lemma 5, it is easy to see that for any constant 1 < w < 1.5, 



(100) 



j 
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for all j = 1, 2, • • • , d. By setting u; = 1.4, we have 



max 



Similarly, for i 6 [0, 1/2), we have 



< 1.4 > 1 - dexp 



fc jll2 



— Co 



n 



>to?l = 



V 100/ 



Mil 



n 



0~a 



3 2 

> ni ] < exp [ — -^Jit 



(101) 



(102) 



By setting i = 3.5W °^ , we have 



n 



max 

i<j<d 



-J \\2 



na] 



<,Ml^)>l-\. 



n 



(103) 



The desired result of this lemma follows from a union bound of (101) and (103). 



□ 



The next lemma bounds the sample standard deviation of each marginal univariate 
Gaussian random variable. 



Lemma 9. Let £ be the sample covariance matrix. Under the assumption that 



we define the event 
A* : = 



, log d 1 
hm sup max 5^- W — < - , 

n^oo l<3<d V n I 



1 -~ -~ 3 

-A min (S) < min Sjj < max S^- < -A max (£) 



We have P(A 3 ) > 1 - 1/d. 

Proof. By the definitions of A max (S) and A m i n (5]), we have 

A m ax(S) > max Yljj > min E^- > A min (S). 

By (85), we know that for any j G {1, . . . , d} and < e < 1/2, 

3ne 2 



(I 



"^33 ~ ^33 I > 6 



) < exp 



16(S^)2 



We choose e = then 



( ,5?^ J E w( 1 ~ *) - - + *) 3% S w ) - exp 



3ni' 
'~16~ 



(104) 



(105) 



(106) 



(107) 



(108) 
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By the union bound, we have 

(1 - i)Amin(S) < min 'Ejj < max E# < (1 + i)A max (£) < dexp — — . (109) 

Under the assumption (104), we know that, for large enough n, there must be e = tSjj < 

/ lose d 

1/2. The desired result of the lemma follows by setting t = 3.5 \ . □ 

V n 

Recall that we define 

Qj(Pj) := \\Z*j - z^p^, (110) 
the next lemma provides theoretical justification to the choice of the tuning parameter A. 



Lemma 10. Let A = c\ — with c > 1 and a > 2, we define an event 

V n 

M := jmax^llVQ^)!^ < A^}- (HI) 

/ 2 2-a( l-2,/ (a - 1)log ^ 1 

F(A 4 )>1- x ——-d V v » (H2) 
v ' V vralogd d a ~ 2 v y 



T/ien 



Proof. Using the fact that Z*,- = X^T^ 1 / 2 and Z*^ = X^r^ 2 , we have 

z, ; z. j-i, ■ v.j (ii3) 



where ~ N n (0, a 2 l n ). We then have 



|VQ,(/3,)L = - || 7 _ 7 a II (114) 



|Z T \ .6,-|| 

I *\j J I loo 

1 1 e J 1 1 2 



(115) 



From the properties of multivariate Gaussian, we know that Z*^ and €j are independent. 
For any £ / j, Z^ f ej follows N(0, na?) distribution when conditioning on Z^. In the 
following argument, we suppose everything is conditioning on Z^j. Since W^jW 2 ,/^ ~ X 2 ; 
by Lemma 5, we know that, for any < r n < 1/2, 

F( max -^l 2 " < 1 ~r n ) < dexp f-^V (116) 
Vi<i<d no 2 ) V 4 / 
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Let $(■) and </>(•) be the cumulative and density functions of a standard Gaussian random 
variable. For any < r n < 1/2, we have 

\z'; : e.r 

(117) 



f max — *} j ,f °° > a/ 2a log d 

INI 2 , 



II u'2 

< dF (HZ^.CjL > vT^^on^iog^ +p^ m ax . 1^ < 1 - r n ) (118) 

< d^F^Z^ejl > VT^Vn^ana] log d^j +de W (- W f) (119) 

< 2d 2 (l - §(^/T^V n ^2a\ogd)) + dexp (-^^J 



(120) 



rf-o(l-r„) / nr 2 



< 2d 2 + ^ eyp (_!^j (121) 

\/2iT\/l — r n ^/2a\ogd 



j2-a(l-r„) / 2 ' 

+ de xp(-^ . (122) 



y^l - r„)a log d 
where the second to last inequality follows from the fact that 



/2id 

whenever t > 1. Now let 



1-. W <M = _1 (''j, (123) 



(124) 



(a — 1) logd 



n 

it can be seen that, when n is large enough, 

F ( max ^ Z *\^°° < ^2alogd^ (125) 
\i<j<d \\e 3 \\ 2 j 

I 2 2-of l-2 A / (a ~ 1)log ^ 1 

Vvralogd d a ~ 2 v ; 

We finish the proof of this lemma. □ 
The next lemma shows that Pj—f3j G Af~ 1 (k). Thus the error vector falls in a restricted 

set. 

Lemma 11. Let (3j be defined as in (27) and c = ^J-. Then, on the event A4, we have 
fy-fye At 1 (k) for allj = l,...,d. 



Proof. Since (3j is the empirical minimizer of the objective function in (27), we have 
-^=||Z*j - Z^/^l^ + AH^I^ < ^|| z *j - z *vA'll2 + A llA/'llr 



(127) 
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Let S :={£: f3 jt > / 0}, it is obvious that \S\ < k. From (127), we get 

^||Z*j - z *\j/3,|| 2 - -y^|| z *i - Z *\i^j|| 2 ( 128 ) 



< All/3,4 -All/3,4 ( 129 ) 

< AlK^Oslli-AlloajOslli-AlK^Hi (130) 

< a(||(^-^)5|| 1 -||(/3 j -^)5c|| 1 ). (131) 

On the event A±, we have c|| VQj(f3j) || < Ay^, where 

QiOSi) == ||Z*i - Z*V»^i|| 2 - (132) 
Therefore, using the fact that Qj(-) is a convex function 

^(iK^-^^l^ + IK^-^cllJ (133) 

= (134) 

- -^QjiPjyfiPi-Pj) (136) 

< ^(Qj@i)-Qj(Pj)) (137) 

By combining the above analysis, we have 

"(II (ft " &)s|li + ||G% - &MI1) < c(||(^- - ^Oslli - ~ Pj)s4i)- (139) 
Therefore, 

IK/3, - &)Hli < ^||(& - &)s\\i = 4(Pj ~ PM\v (I 40 ) 

We finish the proof of this lemma. □ 

B Proof of Matrix Restricted Eigenvalue Conditions 

The following lemma bounds the restricted eigenvalue of the sample correlation matrix R. 

Lemma 12. Let the event A3 be defined as in Lemma 9. We assume klogd = o(n) and 
define an event 



B 1 := { inf v > —-^ \. (141) 

Then, there exist constants c\ and c-i, such that P (B\ \ A3) > 1 — c\ exp(— c^n)- 
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Proof. For any S C {1, 2, ■ ■ ■ , d} with |5| < k, we have, for any (3 G A^f(S), 

Plli < (1 + c)||/3 5 ||i < (1 + c)Vfc||/3 s || 2 < (1 + c)^P|| 2 , (142) 

and 

/3 T S/3 > A min (£)||/3|| 2 > A min (S)||/3 5 ||| > A min (S) k ^ff- )2 , (143) 

where the last inequality uses the fact that \\(3\\i < (1 + c)\/A;||/3s|| 2 . This result is obtained 
from (142). 

Recall that X G R nxd is a matrix and the rows of X are independent iV(0, S) Gaussian 
random vectors, where £ is the d x d covariance matrix. Let XI be the sample covariance 
matrix of X. From Theorem 1 of Raskutti et al. (2010), we know that there exist two 
positive constants c\ and c 2 such that 

F\Jf3 T £f3 > ^V/3 T S/3-9 max y/V^J — ||/3||i, V/3 G R d ) > 1 - a exp(-c 2 n). 

' (144) 
Let r := diag(S), we have, for any (3 G R d , 

P (V (f - V2/3)TE(f -V2/3) > I y/ (f -1/2^2^-1/2^) _ 9 max ^ J^||f -V^H^ 



i<j<d V n 

> 1 — ci exp (— c 2 n) (145) 

Since £ = f i/ 2 RT 1 / 2 ) we have 

{T- l / 2 f3) T %{f- l / 2 (3) = (3 T K/3. (146) 
On the event A3 as defined in Lemma 9, we have 

^A min (S) < mm %j < max S - < ^A max (£). (147) 
It is easy to see that 



|f- 1/2 /3||i < ||/3||i min (Ejj) -1 / 2 < —PH!, (148) 

1<J<^ V A m i n (2jJ 



|f- 1/2 /3|| 2 > ||/3|| 2 max (^)- 1/2 > \ L K ,„A Ph- (149) 



Therefore, 



^(f-V20)r S (f -1/2/3) > ^A min (S)||f-i/ 2 /3|| 2 > \/^||y II/3II2- (150) 
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Plugging (146), (148) and (150) into(145), we have 



msLX. 1 <j< d 'Ejj 


/21ogd. 


Amin(S) \ 


n 



> 1 — ci exp(— an) . 

By (142), for any f3 G A^(fc), we have 

||/3||2>— ^Plli. (152) 
(1 + cJVfc 

Therefore, 



... v 1 /2A min (S) 1/2 /2fcIogd\ 1 



n 



Since we assume fclogci = o(n), for n large enough, we have 



1 /2A mm (E) ^/S^ 1^. (153) 



4(l + E)V3A m «(S) «""V n "5(1 + 8)^' 
We finish the proof of this lemma. □ 
We can see that when k log d = o(n) and n large enough, with high probability, 



'k(3 T R(3 i 

n(R) := inf - > — (154) 

*Ag(*) \\/3\\i -5(l + c)d / a 2 x 

is bounded away from zero by a positive constant. This implies the restricted eigenvalue 
condition required by the SQRT-Lasso method, which is provided in the next lemma. 

Lemma 13. We define the event 

B 2 := ( max Il z »\j(&-fr)ll2 < c . ^ki^d). (155) 

U J Tj J 

Then ¥{B 2 \B 1 ) > 1-1/d. 

Proof. We only need to verify that the Li-restricted eigenvalue condition of the SQRT-Lasso 
is satisfied. More specifically, we need to verify the conditions in Theorem 1 of Belloni et al. 
(2012). Since R = — ZZ T , where Z is defined as in Section 3, we have proved that, on B±, 



mf /-n^n = mf > (156) 

/3GAf(fc) y/n\\Ph /36A^(fc) \\(3\\i h{\ + c)£lL 
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Recall that Z*^ is the n x (d — 1) submatrix of Z with the j th column removed, then 



mm 



. nf Vk\\Z Aj (3\\2 > . nf Vfc||Zj9|| 2 



> 



(157) 



i^d/aeA^fc) >/np||i " /3eA^(fc) v^Plli 5(1 + c)^' 
Therefore, on the event B±, the Li-restricted eigenvalue condition holds for all the SQRT- 
Lasso subproblem defined in (27). The desired result follows from Theorem 1 of Belloni 
et al. (2012). □ 



C Main Lemmas 

We define the event £ := A\ f] A 2 f| ^3 D ^4 f| #i D &2- It is easy to see that P(£) > 
1 — o(l). To prove the main results, we separately analyze the diagonal and off-diagonal 
elements of — 0. In the next subsection, we first control the diagonal elements. 

C.l Analyzing the Diagonal Elements 
Lemma 14. On the event £, we have 



for large enough n. 



max — < C ■ ||0|| 2 



— 1/2 

Proof. Recall that Z*j = Z^j3j + ej, we have 



logd 



n 



(Sjj)- 1 - (Qjj 

3. 

1 — "Z'*\jPj\\2 



r ■ t — r • r 



z* *-» 2 



33 



33 



V--T 2 
L 33 T J 



\z Aj (^-^) + f .. 1/2 ej -||| 



(T, 



n 



< 



1^2 _ a 2 

n J 



+ r 



^\i(Pj-ml + tfi/2\(Pi-PiY z l\jej 



33 ' 



n 



On the event £ , we have 



< 1.4a? and 



n 



— Oa 



n 



33 



n 



logd 



n 



33 



3A max (S) 



1/2 

< max (S;,-) < 

f-1/2 _ ^ ^-1/2 
33 



< 



Amin(£) ' 



(158) 



(159) 
(160) 

(161) 
(162) 
(163) 

(164) 
(165) 

(166) 
(167) 
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Futhermore, since rj = a 2 T • - 1 , we have 

Y j3 \\Z A] 3 - (3 3 )\\l < C 2 -a 2 -klogd. (168) 

We also have 

2? i" 2 10% - ^') T Z^6,| < 2f Jfllft- - • ll^-L. (169) 
Since /3j — (3j G Ag _1 (A;), by the definition of the Li-restricted eigenvalue, we know that 



< f ]f ■ 5(1 + c)£& ■ ^ k@j - PjFRwfa ~ & 

< rf ■ 5(1 + c)ejl • • - ftOlh 

< ff • 5(1 + c)£& -]fl-C- a^y/k^d 

< 5(1 + 5)- C^.^-.fc^ 



Similarly, on the event £ , we have 



l Z Aj e J'IL - "ll e Jll2 



Aa/ti = ajy/2.8n logd. 



Therefore, on £ , 

< 3.5o? A /5? + C V 2 • **** + 10v^8 • (1 + c) • 



n 



n 



n 



Since 



logd 



n 



= o(l), there exists a constant (7 such that, for large enough n, 



< Caj ■ 



2 ,j^gd 



n 



Multiplying 0jj on both sides and using the fact that ®jj = a~ z , we have 



< C 



n 



This implies that 



l + C 



logd 



n 



< 








33 



< 1 -C 



logd 



n 



(170) 
(171) 

(172) 
(173) 
(174) 

(175) 

(176) 
(177) 



(178) 



(179) 



(180) 
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Since, for large enough n, 



I-cJ^kU + C^Y 1 and fl-C A / ! °^V l <l + 2C 1 /l0gd 



n 



we have 



This implies that 



n 



n 



n 



i-cY^\<^<(i + c llogd 



n J ~ ® 



33 



n 



max I©,-,- — ®a\ < C • 1 1 G> 1 1 2 A / — — • 

i<j<d' JJ JJl V n 

The last inequality follows from the fact that maxi<j<d Qjj < ||@||2- 

C.2 Analyzing the Off-diagonal Elements in Li-norm Error 

We first bound the Li-norm of each column of the off-diagonal elements of — 0. 
Lemma 15. On the event S, we have 



max^ll^- - < 5v / 2C(l + c).£ max /e 



'log 


d 


n 


'log 


d 



n 



for large enough n. 
Proof. We recall that 







\3,3 ~ KJ 03 L \j,\j L jj Pj- 



Since r- 2 := a - 2 Tjj = SjjTjj, we have 



®wlli 



< 



< 



? -2f-l/2p-l/2a. _ T -2f-V2f-l/2 fl M 
T 3 L 33 L \j,\j P 3 T j L jj L \j,\j Pj\\l 

K ! \^33 2 (®^- & Ml 



^llll^f ®«l Pi - ^lll + f 33 Z \®n - ®33 



1/2 I 



\3,\3^Wl 



1/2 1 1 iSl/2 



AiAiHil jj 

^-1/2 1| |f,l/2 



i|rJr©«|||ft--^|| 1 + 











- 1 



l wllr 
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On the event £ , using (179) and the analysis of Lemma 14, we have, for large enough n, 



|r. 1 { 2 \\-, < max f ,, 1/2 = max (£-■-■) 1/2 < J , 2 , - , (193) 



t){ 2 <W^A max (£), (194) 



e,,< (1 + cJ 1 ^ 0^<2||0|| 2 , (195) 



n 









<Cd***. (196) 



n 



By Lemma 11, f3j — (3j G Ag x (fc). By the definition of the Li-restricted eigenvalue, we 
know that, on the event £>2, 



~ PjW, < 5(1 + c)£& ■ - PjVK^jiPj - (3j) (197) 

< 5(l + c)e r 1 n / a 2 x -yf -llZ^^-^OIb (198) 

< 5(l + c)eJL-\^-C-ajf- 1/2 ^/k]^d (199) 



5(1 + c) ■ C ■ ejl ■ o 3 ■ f -// 2 • feJ^ (200) 



n 



< 5v^C(l + cK max ^/^, (201) 



where the last inequality follows from the fact that 



^f^e. (202, 



Therefore 

||f -^HJr }f e^-IH^ - < 10(1 + c)V3W • c • a,||0|| 2 • f/ 2 . k J^ (203) 



and 







33 ^ 



I©wlli<^-Il©lli\/^- (204) 



Combining all the above analysis, we have 



ll©W-©wlli < C\\ & \\2 k \l l ^ + C-\\ & 1^- ( 2 05) 
We thus prove the desired result of the lemma. □ 
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C.3 Analyzing the off-diagonal Elements in Sup-norm Error 

To conduct sup-norm analysis of each of the column of the off-diagonal elements, we first 
present a technical lemma: 

Lemma 16. On the event £, there exists a constant C such that 

U/2 



r 



33 rjT 



n 



oo V n 



logd 



Z^iZ^Pj - Z^Pj) KCajd-Z- forall j = l,...,d. (206) 



Proof. Recall that we define 

Q j ( 7 ) = ||Z^-Z Aj - 7 || 2 , (207) 



let A = c r* a ^°&d w jth c > i an d a > 2, since (3j is defined as: 
V n 

ft := argmin{-^Q,( 7 ) + A|| 7 ||i), (208) 

we know that c||VQj(/9j)||oo < y/nX. This means 

\\ Z l\j( Z *j ~ Z *V&)IL ^ V2alogd||Z,,- - Z Aj ^|| 2 . (209) 

^—1/2 

Using the fact that Z*j = Z*\jf3j + r e^, we have 

||Z^.(Z Aj (/3 J -^) + f- 1/ %-)IL ^ V / 2^oid||Z Aj (/3 j -^) + f- 1/ %|| 2 . (210) 
Therefore 

||Z^.Z^(/3,-^-)|L ( 211 ) 

< v / 2«logd 1 1 Z Aj - - ^ ) 1 1 2 + v^alogdf-/ /2 || ej -|| 2 + T^ l2 \\Z%e j \\ oo . (212) 

On the event £ , we have 

IM2 < o-j • Vl.4n, (213) 

||Z^£j||oo < \/2ologd||ej||2 < o-j • \/2-8anlogd, (214) 

IIZ^^- - 3j)|| 2 < CtjV* logd < C • a, • f- 1/2 Vfclogd. (215) 

By piecing all these terms together, we have 

||Z^.Z^(/3,-^-)|L ( 216 ) 
< C-a r f~ 1/2 V2^k ■ log d + 2a j ■ fj J 1/2 ^2.8an logd. (217) 
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Therefore 



^1/2 



n 



<C-Oi 



n 



n 



The result follows from the fact that 



\/k\ogd /logd 



n V n 

for large enough n. We finish the proof of this lemma. 



(218) 



□ 



Lemma 17. On the event £, we have 

max ||0\i ~ — @\ J! < C||@|L 

Kj<d W'Jiloo — II 111 

for large enough n. 

Proof. We have the following decomposition 



logd 
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? -2f-l/2f-l/2a. _ -2f-l/2ft-l/2 fl || 



33 

33 ^in f uv(^ " &)IL + r.;f|e^ - ©«|||r^(;ft-| 



f^ 2 © 



^-1/2/ 



V-l/2. 



f 1/2 



-1/2 



r)f^ll|fu(;(/3,-/3,)L + 



(^-^OIL + |0^-0^II|0\,,,07/ 



j j 1 1 00 
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Following a similar argument as in Lemma 15, we have, on the event £, 







33 







- 1 



33 



|0\jjlloo < c ■ 11©^ 



logd 



77, 



Now for the first term in (225), we have 



|pl/2|i|J-l/2 



= II f ~Y 2 Rr.\ .R\ 



U/2 

\3,\3 n '\3,\3 n -\i>\i L 33 



? (P3-P3 



^ llf-\ /2 Rr.\. 



R> 



pl/2 
\3\3 L jj 



< 



-1/2, 



r; 



-1 



(ft - & 

U/2, 



1/2/ 



To analyze the term ||R\j ^T^ (/3j — /3j ) 1 1 ^ ? we have 



Ruvrjf^-^) 



< 



< 



fR 



1 n Z *\3 Z *\3) f )3 2 ^ ~ ftOlloc + ll^^f^ " 



(219) 



R> 



U/2 



Z *\j Z *\j(^i ~~ ^ 



(220) 
(221) 

(222) 

(223) 

(224) 

(225) 



(226) 

(227) 
(228) 
(229) 
(230) 

(231) 
(232) 

. (233) 
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From (185), we have, on the event £: 

\\0j - Pjlh < 5>/2C(l + cU max k^^. 
By Lemma 7, we have, on the event E: 

II 1 T || ||~ || 

ll R \iAi ~ - Z *\i Z *\illmax - ll R_ R llmax - 18 

On the other hand, let A be an invertible matrix, we have 



\ogd 



n 



which implies that 



|AA _1 || < llAll llA -1 !! , 

Moo — II lloo II Moo' 



< IIA" 



Therefore, 



IR 



mm 



Pile 



< 



Moo 



mm 



/3^0,/3eK d ||R/3||c 

< ll R_1 H 

— II 1 1 oo 

< max 5LJI0IL 

l<j<d 11 111 

< A max (S)||0|| r 



From Lemma 16, we have 



m/2 

" 33 rjT 



n 



< Co, 



logd 



n 



Therefore, by piecing all these terms together: 



< r 



1/2 
33 



18^.5^(1 + 0).^^ + ^. 



n 



logd 



n 
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Again, on the event £ , we have 



Tjj < ^Amax(S), (247) 



-1/2 1. 

mm^dfaj) 1 ' 2 ~ V A min(S)' 



K{j\L < T^TTT, < i/^-^, (248) 



S j3 <[l + Cj 1 ^ 0,,< 211011, = ^, (249) 



1 1 



^ - "7= < . = VAmax(S). (250) 

V J J V ^min ( " J 

Putting all the terms in (242), (246), (248), (249) together, we get, for large enough n: 

l f jfe«ll|r^(ft-ft)L (251) 

£ ^j^Wlr Vs WE) + CvQS ')/^) (253) 

• 1 ^ + 2^j£. .||e|| 1 . ) /S? 



< 2V^^.c.||e|| 1 .^ + 2V2de-c--||e|| 1 .J^ ( 254) 



< 2(73 + 72)4^. C-Hell,.^, (255) 



where the last inequality follows from the fact that ky/Togd < \fn for large enough n. 
Therefore, the lemma is proved. □ 

D Proof of the Main Theorems 

In this section, we prove the main results based on the previous technical lemmas. 
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D.l Proof of Theorem 2 

Proof. By piecing together the results of Lemma 14 and Lemma 15, we have 

II© — @|L = max I!©*, — 0*i Hi (256) 

l<3<d 

< max I©,-,- — 0ii| + max ||©\-- — ©\,- ,-IL (257) 
~~ i<i<d i<i<d u v 1 



< ciieiV^ + cdlen.t + llellJ^ (2 5 8 , 



< cdlell^ + HellJ^ (259) 



< c(*||e|| a) /^). (2 6 o) 

The last inequality follows from the fact that 

Hell. < fc||e|| < fcllelL. (26i) 

II 1 1 1 — II II max — 1 1 1 1 2 v ' 

We complete the proof of this theorem. □ 
D.2 Proof of Theorem 3 

Proof. By piecing together the results of Lemma 14 and Lemma 17, we have 



I© " ®Lax = ™ l®« " & n\ + ™II W - Q \«IL ^ C \\ & \\i\h^- (262) 



n 



We complete the proof of this theorem. □ 
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